Aggregate the data in R - r

I have a data set that is shown below:
library(tidyverse)
data <- tribble(
~category, ~product_id,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
)
And now, I want to group it by the "category" variable, keep the "product_id" and add a new variable that counts the categories:
aggregated_data <- tribble(
~category, ~product_id, ~numberOfcategory
"A", 10, 5,
"B", 20, 3,
"C", 30, 2,
)
I already got the "numberOfcategory" with this code:
data %>%
group_by(category) %>%
tally(sort=TRUE)
But somehow I could not keep the product_id.
Could someone help me to get the dataframe (aggregated_data)? Thanks in advance.

You were close! Just also group by product_id as follows:
data %>%
group_by(category,product_id) %>%
tally(sort=TRUE)

Related

proportion within each factor using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 1 year ago.
I want to get the prop inside each factor using dplyr. The desired result appears in desired$prop
Thanks in advance :))
data <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa","uk",
"spain","usa","uk","spain","usa","uk","spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25)
)
desired <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa",
"uk","spain","usa","uk","spain","usa","uk",
"spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25),
prop = c(0.285714286,0.181818182,0.142857143,0.357142857,
0.272727273,0.5,0.357142857,0.545454545,
0.357142857)
)
#MrFlick is right. And also faster than I am.
library(dplyr)
df <- data %>%
group_by(country) %>%
mutate(prop = value/sum(value))

Problem with 'mutate()' input 'data' in ANOVA (rstatix)

This is driving me crazy. I am using anova_test from rstatix and it's telling me that my columns aren't there when they clearly are.
This is what my dataframe looks like:
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
df <- data.frame(ID, Form, Pen, Time)
ID, Form, and Pen are factors, while Time is numeric. So each subject completed forms A and B with Red, Blue, and Green pens, and I measured how long each took in completing the form.
This is a fake dataset that I've purposefully come up with to ask this question. In reality, this dataframe is derived from a larger dataset with several more variables. Each variable has a lot more observations (so not just one datapoint for subject 1 & Form A & Red Pen, as in this example, but multiple), so I've summarized them to get mean Time.
df <- original.df %>% dplyr::select(ID, Form, Pen, Time)
df <- df %>% dplyr::group_by(ID, Form, Pen) %>% dplyr::summarise(Time = mean(Time))
df <- df %>% convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
I wanted to test the main and interaction effects, so I'm doing a 2 by 3 repeated measures ANOVA (a two-way ANOVA, because Form and Pen are two independent variables).
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
and I KEEP getting this error:
Error: Problem with `mutate()` input `data`.
x Can't subset columns that don't exist.
x Columns `ID` and `Form` don't exist.
ℹ Input `data` is `map(.data$data, .f, ...)`.
WHY?! Any help would be greatly appreciated. I've been searching solutions for HOURS and I'm getting pretty frustrated.
Thank you for adding the additional details to the post - based on what you've provided it looks like you need to ungroup your df before passing it to anova_test(), e.g.
#install.packages("rstatix")
library(rstatix)
library(tidyverse)
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
original.df <- data.frame(ID, Form, Pen, Time)
df <- original.df %>%
dplyr::select(ID, Form, Pen, Time)
df <- df %>%
dplyr::group_by(ID, Form, Pen) %>%
dplyr::summarise(Time = mean(Time))
df <- df %>%
convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
df <- ungroup(df)
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
You can see whether a dataframe is grouped using str(), e.g. str(df) before and after ungrouped() shows you the difference. Please let me know if you are still getting errors after making this change

Tidyverse: group_by, arrange, and lag across columns

I am working on a projection model for sports where I need to understand in a certain team's most recent game:
Who is their next opponent? (solved)
When is the last time their next opponent played?
reprex that can be used below. Using row 1 as an example, I would need to understand that "a"'s next opponent "e"'s most recent game was game_id_ 3.
game_id_ <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
game_date_ <- c(rep("2021-01-29", 6), rep("2021-01-30", 6))
team_ <- c("a", "b", "c", "d", "e", "f", "b", "c", "d", "f", "e", "a")
opp_ <- c("b", "a", "d", "c", "f", "e", "c", "b", "f", "d", "a", "e")
df <- data.frame(game_id_, game_date_, team_, opp_)
#Next opponent
df <- df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L))
If I can provide more details, please let me know.
We can use match to return the corresponding game_id_
library(dplyr)
df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L)) %>%
ungroup %>%
mutate(last_time = game_id_[match(next_opp, opp_)])

How to find the similarity in R?

I have a data set as I've shown below:
It shows which book is sold by which shop.
df <- tribble(
~shop, ~book_id,
"A", 1,
"B", 1,
"C", 2,
"D", 3,
"E", 3,
"A", 3,
"B", 4,
"C", 5,
"D", 1,
)
In the data set,
shop A sells 1, 3
shop B sells 1, 4
shop C sells 2, 5
shop D sells 3, 1
shop E sells only 3
So now, I want to calculate the Jaccard index here. For instance, let's take shop A and shop B. There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3).
Here is the sample of the desired data:
df <- tribble(
~shop_1, ~shop_2, ~similarity,
"A", "B", 33.3,
"B", "A", 33.33,
"A", "C", 0,
"C", "A", 0,
"A", "D", 100,
"D", "A", 100,
"A", "E", 50,
"E", "A", 50,
)
Any comments/assistance really appreciated! Thanks in advance.
I don't know about a package but you can write your own function. I guess by similarity you mean something like this:
similarity <- function(x, y) {
k <- length(intersect(x, y))
n <- length(union(x, y))
k / n
}
Then you can use tidyr::crossing to merge the same data frame with itself
dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>%
filter(shop_A != shop_B) %>%
mutate(similarity = map2_dbl(books_A, books_B, similarity))

Allocate ordinal values to numerical vector in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a set of data from children, recorded across a number of sessions. The number of sessions and age of each child in each session is different for each participant, so it looks something like this:
library(tibble)
mydf <- tribble(~subj, ~age,
"A", 16,
"A", 17,
"A", 19,
"B", 10,
"B", 11,
"B", 12,
"B", 13)
What I don't currently have in the data is a variable for Session number, and I'd like to add this to my dataframe. Basically I want to create a numeric variable that is ordinal from 1-n for each child, something like this:
mydf2 <- tribble(~subj, ~age, ~session,
"A", 16, 1,
"A", 17, 2,
"A", 19, 3,
"B", 10, 1,
"B", 11, 2,
"B", 12, 3
"B", 13, 4)
Ideally I'd like to do this in dplyr().
You simply need to group by subj and use row_number():
mydf %>%
group_by(subj) %>%
mutate(session = row_number())

Resources