I have count the values of a column with dplyr.
yelp_tbl %>% select(name) %>% count(name)
The resulting data looks like this:
# A tibble: 108,999 x 2
name n
<chr> <int>
1 'do blow dry bar 1
2 'Round Table Tours 1
3 'S Hundehüttle 1
4 # 1 Nails 1
5 #1 Cochran Buick GMC of Monroeville 1
6 #1 Cochran Buick GMC of Robinson 1
7 #1 Cochran Cadillac - Monroeville 2
Now I want to make a boxplot of the "n" column.
yelp_tbl %>% select(name) %>% count(name) %>% boxplot(n)
But I got this result:
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator
Any Idea? Is it because of the function?
Pull the column out as a numeric vector and then do boxplot:
library(stringi)
df <- data.frame(name = stri_rand_strings(10000, 2, pattern = '[a-z]'))
df %>% select(name) %>% count(name) %>% pull(n) %>% boxplot()
# ^^^^^^
Try this (it's hard to know if it works without example data):
library(tidyverse)
yelp_tbl %>%
select(name) %>%
count(name) %>%
ggplot(aes(name, n)) +
geom_bar(stat = "identity", position = "dodge")
Try switching between x and y axis to see if it works the other way using boxplot(), or use a different function like ggplot() + geom_boxplot()
Example:
boxplot(yelp_tbl, aes(x = n, y = name))
or
yelp_tbl %>% ggplot(aes(n, name)) + geom_boxplot()
Related
this is not a very good title for the question. I want to sum across certain columns in a data frame for each group, excluding one column for each of my groups. A simple example would be as follows:
df <- tibble(group_name = c("A", "B","C"), mean_A = c(1,2,3), mean_B = c(2,3,4), mean_C=c(3,4,5))
df %>% group_by(group_name) %>% mutate(m1 = sum(across(contains("mean"))))
This creates column m1, which is the sum across mean_a, mean_b, mean_c for each group. What I want to do is exclude mean_a for group a, mean_b for b and mean_c for c. The following does not work though (not surprisingly).
df %>% group_by(group_name) %>% mutate(m1 = sum(across(c(contains("mean") & !contains(group_name)))))
Do you have an idea how I could do this? My original data contains many more groups, so would be hard to do by hand.
Edit: I have tried the following way which solves it in a rudimentary fashion, but something (?grepl maybe) seems to not work great here and I get the wrong result.
df %>% pivot_longer(!group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Edit2: Found out what's wrong with the above, and below works, but still a lot of warnings so I recommend people to follow TarJae's response below
df %>% pivot_longer(!group_name) %>% group_by(group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Here is another option where you can just use group_name directly with the tidyselect helpers:
df %>%
rowwise() %>%
mutate(m1 = rowSums(select(across(starts_with("mean")), -ends_with(group_name)))) %>%
ungroup()
Output
group_name mean_A mean_B mean_C m1
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 5
2 B 2 3 4 6
3 C 3 4 5 7
How it works
The row-wise output of across is a 1-row tibble containing only the variables that start with "mean".
select unselects the subset of the variables from output by across that end with the value from group_name.
At this point you are left with a 1 x 2 tibble, which is then summed using rowSums.
Here is one way how we could do it:
We create a helper column to match column names
We set value of mean column to zeor if column names matches helper name.
Then we use transmute with select to calculate rowSums
Finally we cbind column m1 to df:
library(dplyr)
df %>%
mutate(helper = paste0("mean_", group_name)) %>%
mutate(across(starts_with("mean"), ~ifelse(cur_column()==helper, 0, .))) %>%
transmute(m1 = select(., contains("mean")) %>%
rowSums()) %>%
cbind(df)
m1 group_name mean_a mean_b mean_c
1 5 a 1 2 3
2 6 b 2 3 4
3 7 c 3 4 5
From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)
In R, when I run this group_by code, I obtain this result.
df <- tibble(y=c('a','a','a', 'b','b','b','b','b'), z=c(1,1,1,1,1,1,2,2))
df %>% group_by(z,y) %>% summarise(n())
z y n()
1 a 3
1 b 3
2 b 2
Is there a way to make it look like this?
z y n()
1 a 3
b 3
2 b 2
My goal is to have the formatting look the way it does in Pandas, where the multilevel index isn't repeated each time ( see below ).
Here's one possibility:
df <- tibble(y=c('a','a','a', 'b','b','b','b','b','a','b'), z=c(1,1,1,1,1,1,2,2,3,3))
df2 <-
df %>%
group_by(z,y) %>%
summarise(n = n()) %>%
group_by(z) %>%
mutate(z2 = if_else(row_number() == 1, as.character(z), " "), y, n) %>%
ungroup() %>%
transmute(z = z2, y, n)
df2 %>%
knitr::kable()
I'm having trouble thinking of ways to do this that don't involve grouping by the z column and finding the first row. Unfortunately that means you need to add a couple steps, because a grouping variable can't be modified in the mutate call.
I need to repeat an operation many times for a different combinations of two different variables (trying to create data for stacked barplots showing percentage. Could anyone turn the code below into a function (of dataset, and the two variables x and y) in order to create the new data sets quickly? Or give me some good reference or link for learning about functions and dplyr. Thanks.
dat = df %>%
select(x, y) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
return(dat)
As suggested in the comments above, step-by-step explanations can be found here: dplyr.tidyverse.org/articles/programming.html
This guide will provide explanation of quo() function and !! symbols.
For your example you can create a function like so:
df1<- data.frame(x1 = c(rep(3,5), rep(7,2)),
y1 = c(rep(2,4), rep(5,3)))
my.summary <- function(df, x, y){
df %>%
select(!!x, !!y) %>%
group_by(!!x, !!y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(!!x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
}
my.summary(df1, quo(x1), quo(y1))
# # A tibble: 3 x 5
# x1 y1 n percentage pos
# <dbl> <dbl> <int> <dbl> <dbl>
# 1 3 2 4 80 40
# 2 3 5 1 20 90
# 3 7 5 2 100 50
I have two columns in a data.frame, that should have levels sorted in the same order, but I don't know how to do it in a straightforward manner.
Here's the situation:
library(ggplot2)
library(dplyr)
library(magrittr)
set.seed(1)
df1 <- data.frame(rating = sample(c("GOOD","BAD","AVERAGE"),10,T),
div = sample(c("A","B","C"),10,T),
n = sample(100,10,T))
# I'm adding a label column that I use for plotting purposes
df1 <- df1 %>% group_by(rating) %>% mutate(label = paste0(rating," (",sum(n),")")) %>% ungroup
# # A tibble: 10 x 4
# rating div n label
# <fctr> <fctr> <int> <chr>
# 1 BAD C 48 BAD (220)
# 2 BAD B 87 BAD (220)
# 3 BAD C 44 BAD (220)
# 4 GOOD B 25 GOOD (77)
# 5 AVERAGE B 8 AVERAGE (117)
# 6 AVERAGE C 10 AVERAGE (117)
# 7 AVERAGE A 32 AVERAGE (117)
# 8 GOOD B 52 GOOD (77)
# 9 AVERAGE C 67 AVERAGE (117)
# 10 BAD C 41 BAD (220)
# rating levels are sorted
df1$rating <- factor(df1$rating,c("BAD","AVERAGE","GOOD"))
ggplot(df1,aes(x=rating,y=n,fill=div)) + geom_col() # plots in the order I want
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col() # doesn't because levels aren't sorted
How do I manage to copy the factor order from one column to another ?
I can make it work this way but I think it's really awkward:
lvls <- df1 %>% select(rating,label) %>% unique %>% arrange(rating) %>% extract2("label")
df1$label <- factor(df1$label,lvls)
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col()
Instead of adding a label column and use aes(x = label, you may stick to aes(x = rating, and create the labels in scale_x_discrete:
ggplot(df1, aes(x = rating, y = n, fill = div)) +
geom_col() +
scale_x_discrete(labels = df1 %>%
group_by(rating) %>%
summarize(n = sum(n)) %>%
mutate(lab = paste0(rating, " (", n, ")")) %>%
pull(lab))
Once you have set the levels of rating, you can use forcats to set the levels of label by the order of rating like this...
library(forcats)
df1 <- df1 %>% group_by(rating) %>%
mutate(label=paste0(rating," (",sum(n),")")) %>%
ungroup %>%
arrange(rating) %>% #sort by rating
mutate(label=fct_inorder(label)) #set levels by order in which they appear
Or you can use forcats::fct_reorder to do the same thing...
df1$label <- fct_reorder(df1$label, as.numeric(df1$rating))
The plot then has the bars in the right order.