This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 1 year ago.
I have a long data frame that i want to widen, but one key has two different values:
df <- data.frame(ColA=c("A", "B", "B", "C"), ColB=letters[23:26])
ColA ColB
1 A w
2 B x
3 B y
4 C z
I want my output to be a paste of the two values for this key together:
ColA ColB
1 A w
2 B xy
3 C z
A regular pivot_wider() will throw a warning and convert the values to lists:
df.wide <- df %>%
pivot_wider(names_from=ColA, values_from=ColB)
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
# A tibble: 1 x 3
A B C
<list> <list> <list>
1 <chr [1]> <chr [2]> <chr [1]>
Based on the warning it looks like pivot_wider() with a value_fn() is similar to what I want as an intermediate step:
# intermediate step
df.wide <- df %>%
pivot_wider(names_from=ColA, values_from=ColB, values_fn=SOMETHING)
A B C
1 w xy z
But it looks like values_fn() only takes summary functions, and not something that would work on character data (like paste())
The closest I can get is:
df %>%
pivot_wider(names_from=ColA, values_from=ColB, values_fn=list) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols=everything(), names_to="ColA", values_to="ColB")
# A tibble: 3 x 2
ColA ColB
<chr> <chr>
1 A "w"
2 B "c(\"x\", \"y\")"
3 C "z"
With an additional mutating gsub()-type function. Surely there's an easier way! Preferably within the tidyverse, but also open to other packages.
Thanks
I don't think you need to pivot here, unless your real data is more complicated than the example shown.
library(dplyr)
df %>%
group_by(ColA) %>%
summarise(ColB = paste0(ColB, collapse = ""))
Result:
# A tibble: 3 × 2
ColA ColB
<chr> <chr>
1 A w
2 B xy
3 C z
Related
I ran into an annoying issue earlier today where I had a dataframe with hundreds of columns that I had been given. I was then attempting to select rows from this dataframe using a list I had created using a different process. When I attempted to filter using the list, I got a blank dataframe in return. After struggling with this for awhile I realized that the massive dataframe I was selecting from also had a column with the same name as my list, and that my filter action was usinig this as priority.
My question is, is there a better way I should be filtering from dataframes rather than the way I am currently? I do not like that it is ambiguous if a column or a list is used. Here is a minimum example to show this:
Consider a dataframe which has two columns, a and b:
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
# A tibble: 3 × 2
# a b
# <chr> <chr>
# 1 first 2
# 2 second 3
# 3 third 4
I would then like to select rows from this dataframe using a list of values I created using a different process. Notice that the first list is named b, which is also the name of one of the columns in the df.
b = c("first")
d = c("first")
These two commands are almost the same, except that the first filters based on the column (and therefore returns nothing) and the second filters based on the list(and therefore returns the first row):
# Returns Nothing:
df %>%
filter(a %in% b)
# # A tibble: 0 × 2
# … with 2 variables: a <chr>, b <chr>
# ℹ Use `colnames()` to see all variable names
# Returns Desired Row(s)
df %>%
filter(a %in% d)
# A tibble: 1 × 2
# a b
# <chr> <chr>
# 1 first 2
Is there a better way to filter which is less ambiguous? I guess I would like an error or something like that. I realize this is kind of an edge case.
You can use .data$ and.env$ from rlang to distinguish between the variable in the data set and the object in the environment.
df %>%
filter(a %in% .env$b)
A tibble: 1 × 2
a b
<chr> <chr>
1 first 2
You can use !! to evaluate the vector b, rather than use the variable b from the dataset. It also works with vectors that are not also variable names in the data, like d. So, if you imagined this happening a lot, you could always prefix the vector of values you're filtering on with !! and you won't run unto this problem.
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
b = c("first")
d = c("first")
df %>%
filter(a %in% !!b)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
df %>%
filter(a %in% !!d)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
Created on 2023-02-10 by the reprex package (v2.0.1)
I would like to be able create a list of variables, and then perform a count of a target variable for each level of each variable in the list. For clear presentation of the results, I'd like my end result to take the form of four columns: Variable, Level, Result, and Count.
Consider this partially-there example, borrowing heavily from Brad Cannel's answer at dplyr- group by in a for loop r:
df <- tibble(
var1 = c(rep("a", 5), rep("b", 5)),
var2 = c(rep("c", 3), rep("d", 7)),
var3 = rnorm(10),
result=c("good","bad","good","bad","good","bad","good","bad","good","bad")
)
groups <- c(quo(var1), quo(var2)) # Create a quoture
results<-list()
for (i in seq_along(groups)) {
results[[i]]<-df %>%
group_by(!!groups[[i]]) %>% # Unquote with !!
count(result)
}
all_results<-bind_rows(results)
At this point, the column n has the counts that I'd like. Rather than having columns named var1 and var2, I'm hoping to produce a result that looks like:
desired_results<-tibble(
variable=c("var1","var1","var1","var1","var2","var2","var2","var2"),
level=c("a","a","b","b","c","c","d","d"),
result=c("bad","good","bad","good","bad","good","bad","good"),
n=c(2,3,3,2,1,2,4,3)
)
I have tried using mutate in the loop to produce my result, but can't get the formatting correct:
for (i in seq_along(groups)) {
results[[i]]<-df %>%
group_by(!!groups[[i]]) %>% # Unquote with !!
count(result)%>%
mutate(level=!!groups[[i]])%>%
mutate(variable=groups[i])%>%
ungroup()%>%
select(variable,level,result,n)
}
I figured out how to "get there" using pivot_longer, like so (albeit just needing to rename columsn afterwards):
all_results2<-all_results%>%
pivot_longer(cols=c(-result,-n))%>%
filter(!(is.na(value)))
I'd really like to know how I could avoid this and just produce a column that houses the variable name right there in the for loop, and I'm guessing I'm missing some key bit of syntax. Any help in finding and explaining the solution would be greatly appreciated!
This could be done with pivot_longer, without looping, then binding the rows etc
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = var1:var2, names_to = 'variable',
values_to ='level') %>%
count(variable, level, result)
-output
# A tibble: 8 x 4
variable level result n
<chr> <chr> <chr> <int>
1 var1 a bad 2
2 var1 a good 3
3 var1 b bad 3
4 var1 b good 2
5 var2 c bad 1
6 var2 c good 2
7 var2 d bad 4
8 var2 d good 3
Another option in the larger tidyverse, instead of the for loop, would be a call to purrr::map_dfr.
library(tidyverse)
groups <- c("var1", "var2")
map_dfr(groups,
~ tibble(variable = .x,
count(df, level = !!sym(.x), result)))
#> # A tibble: 8 x 4
#> variable level result n
#> <chr> <chr> <chr> <int>
#> 1 var1 a bad 2
#> 2 var1 a good 3
#> 3 var1 b bad 3
#> 4 var1 b good 2
#> 5 var2 c bad 1
#> 6 var2 c good 2
#> 7 var2 d bad 4
#> 8 var2 d good 3
Created on 2021-07-21 by the reprex package (v0.3.0)
I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667
I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))
Suppose I have some count data that looks like this:
library(tidyr)
library(dplyr)
X.raw <- data.frame(
x = as.factor(c("A", "A", "A", "B", "B", "B")),
y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
z = 1:6
)
X.raw
# x y z
# 1 A i 1
# 2 A ii 2
# 3 A ii 3
# 4 B i 4
# 5 B i 5
# 6 B i 6
I'd like to tidy and summarise like this:
X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:
X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 B i 15
# 3 A ii 5
# 4 B ii 0
But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?
Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.
Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:
X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0
This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).
The complete function from tidyr is made for just this situation.
From the docs:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data.
You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).
X.raw %>%
complete(x, y, fill = list(z = 0)) %>%
group_by(x,y) %>%
summarise(count = sum(z))
Source: local data frame [4 x 3]
Groups: x [?]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.
# Complete after ungrouping
X.tidy %>%
ungroup %>%
complete(x, y, fill = list(count = 0))
# Complete within grouping
X.tidy %>%
complete(y, fill = list(count = 0))
The result is the same for each option:
Source: local data frame [4 x 3]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can use tidyr's expand to make all combinations of levels of factors, and then left_join:
X.tidy %>% expand(x, y) %>% left_join(X.tidy)
# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii NA
Then you may keep values as NAs or replace them with 0 or any other value.
That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.
plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by #momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:
library(plyr)
X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)
X.tidy
x y count
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You could explicitly make all possible combinations and then joining it with the tidy summary:
x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
left_join(x.tidy, by=("x", "y")) %>%
mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's
You can also use the data.table package and its Cross Join CJ() function for that.
require(data.table)
X = data.table(X.raw)[
CJ(y = y,
x = x,
unique = TRUE),
on = .(x, y)
][ , .(z = sum(z)), .(x, y) ][ order(x, y) ]
X
# filling the NAs with 0s
setnafill(X, fill = 0, cols = 'z')
X
# x y z
# 1: A i 1
# 2: A ii 5
# 3: B i 15
# 4: B ii 0
Though it's not initially asked for, I'm adding a data.table solution here for the sake of completeness and to also link to the related data.table question.