Problem with 'mutate()' input 'data' in ANOVA (rstatix) - r

This is driving me crazy. I am using anova_test from rstatix and it's telling me that my columns aren't there when they clearly are.
This is what my dataframe looks like:
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
df <- data.frame(ID, Form, Pen, Time)
ID, Form, and Pen are factors, while Time is numeric. So each subject completed forms A and B with Red, Blue, and Green pens, and I measured how long each took in completing the form.
This is a fake dataset that I've purposefully come up with to ask this question. In reality, this dataframe is derived from a larger dataset with several more variables. Each variable has a lot more observations (so not just one datapoint for subject 1 & Form A & Red Pen, as in this example, but multiple), so I've summarized them to get mean Time.
df <- original.df %>% dplyr::select(ID, Form, Pen, Time)
df <- df %>% dplyr::group_by(ID, Form, Pen) %>% dplyr::summarise(Time = mean(Time))
df <- df %>% convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
I wanted to test the main and interaction effects, so I'm doing a 2 by 3 repeated measures ANOVA (a two-way ANOVA, because Form and Pen are two independent variables).
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
and I KEEP getting this error:
Error: Problem with `mutate()` input `data`.
x Can't subset columns that don't exist.
x Columns `ID` and `Form` don't exist.
ℹ Input `data` is `map(.data$data, .f, ...)`.
WHY?! Any help would be greatly appreciated. I've been searching solutions for HOURS and I'm getting pretty frustrated.

Thank you for adding the additional details to the post - based on what you've provided it looks like you need to ungroup your df before passing it to anova_test(), e.g.
#install.packages("rstatix")
library(rstatix)
library(tidyverse)
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
original.df <- data.frame(ID, Form, Pen, Time)
df <- original.df %>%
dplyr::select(ID, Form, Pen, Time)
df <- df %>%
dplyr::group_by(ID, Form, Pen) %>%
dplyr::summarise(Time = mean(Time))
df <- df %>%
convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
df <- ungroup(df)
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
You can see whether a dataframe is grouped using str(), e.g. str(df) before and after ungrouped() shows you the difference. Please let me know if you are still getting errors after making this change

Related

R: Create every possible combinations of given columns

I have a data that corresponds to df. df shows the source and destination and the longitudes and latitudes of this sources and destinations.
I want to use df to generate df1. df1 gives all possible combinations of source and destination and while doing so combines the appropriate source and destination longitudes and latitudes.
Source <- c("A", "B", "C", "D")
Destination <- c("A", "B", "C", "D")
Source_Latitude <- c(1, 2, 3, 4)
Source_Longitude <- c(-1, -2, -3, -4)
Dest_Latitude <- c(1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4)
df <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
Source <- c("A", "A", "A", "A", "B","B","B","B", "C","C","C","C", "D","D","D","D")
Destination <- c("A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D")
Source_Latitude <- c(1,1,1,1, 2, 2, 2, 2, 3,3,3,3, 4,4,4,4)
Source_Longitude <- c(-1,-1,-1,-1,-2,-2,-2,-2,-3,-3,-3,-3,-4,-4,-4,-4)
Dest_Latitude <- c(1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4)
df1 <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
I tried using crossing() and expand.grid() without any success
library(dplyr)
expand.grid(Source = Source, Destination = Destination) %>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
select(contains("Source"), contains("Dest")) %>% View()
As an additional observation, although the code works, I don't think it's the best to keep sources and destinations in the same dataframe. Because the number of sources and destinations may be different. It would probably be best to have one data frame for each, and adapt the code accordingly.
all_combination<- expand.grid(Source=df$Source, Destination=df$Destination)%>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
distinct()
This worked for me. Took a while to figure out how to use the expand.grid()function.

Aggregate the data in R

I have a data set that is shown below:
library(tidyverse)
data <- tribble(
~category, ~product_id,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
)
And now, I want to group it by the "category" variable, keep the "product_id" and add a new variable that counts the categories:
aggregated_data <- tribble(
~category, ~product_id, ~numberOfcategory
"A", 10, 5,
"B", 20, 3,
"C", 30, 2,
)
I already got the "numberOfcategory" with this code:
data %>%
group_by(category) %>%
tally(sort=TRUE)
But somehow I could not keep the product_id.
Could someone help me to get the dataframe (aggregated_data)? Thanks in advance.
You were close! Just also group by product_id as follows:
data %>%
group_by(category,product_id) %>%
tally(sort=TRUE)

How to correctly add a transformed variable to ggplot axis

I would like to plot a transformed variable (in this case an average shift value) on the y axis. For the life of me I can't understand how to get R to plot the overall result (not just the calculated sum of each day's average). Any help would be greatly appreciated.
# set up
library(tidyverse)
# example data
df <-
tribble(
~Week, ~Day, ~Team, ~Sales, ~Shifts,
"WK1", 1, "A", 100, 1,
"WK1", 1, "B", 120, 1,
"WK1", 2, "A", 100, 1,
"WK1", 2, "B", 120, 1,
"WK1", 3, "A", 100, 1,
"WK1", 3, "B", 120, 1,
"WK1", 4, "A", 100, 1,
"WK1", 4, "B", 120, 1,
"WK1", 5, "A", 100, 1,
"WK1", 5, "B", 120, 1,
"WK1", 6, "A", 100, 1,
"WK1", 6, "B", 120, 1,
"WK1", 7, "A", 100, 1,
"WK1", 7, "B", 120, 1
)
# P1: y axis is not the shift average as desired. For example, Team A's shift average should be 100.
ggplot(df, aes(x = Week, y = (Sales/Shifts) )) +
geom_col() +
facet_grid(.~ Team)
# P2: ggplot seems to be calculating the sum of each individual day's shift average
ggplot(df, aes(x = Week, y = (Sales/Shifts), fill = Day )) +
geom_col() +
facet_grid(.~ Team)
The overall shift average should be
Team A: 100
Team B: 120
I'd recommend summarizing your data and giving ggplot the values you want to plot, rather than trying to use the graphics package to do the data manipulation for you.
df_avg = df %>%
group_by(Team, Week) %>%
summarize(Shift_Avg = mean(Sales / Shifts))
## or maybe you want sum(Sales) / sum(Shifts) ? Might be more appropriate
ggplot(df_avg, aes(x = Week, y = Shift_Avg)) +
geom_col() +
facet_grid(~ Team)

Plot observation number (label) in outlier points

I have this boxplot with outliers, i need to plot the number of the line that contain the outlier observation, to make it easy to go in the data set and find where the value, somebody can help me?
set.seed(1)
a <- runif(10,1,100)
b <-c("A","A","A","A","A","B","B","B","B","B")
t <- cbind(a,b)
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1] + 2, labels = round(bp$stats[,1], 2))
text(x = 2, y = bp$stats[,2] + 2, labels = round(bp$stats[,2], 2))
What is the point of t <- cbind(a, b)? That makes a character matrix and converts your numbers to character strings? You don't use it anyway. If you want a single data structure use data.frame(a, b) which will make a a factor and leave b numeric. I do not get the plot you do with set.seed(1) so I'll provide slightly different data. Note the use of the pos= and offset= arguments in text(). Be sure to read the manual page to see what they are doing:
a <- c(99.19, 59.48, 48.95, 18.17, 75.73, 45.94, 51.61, 21.55, 37.41,
59.98, 57.91, 35.54, 4.52, 64.64, 75.03, 60.21, 56.53, 53.08,
98.52, 51.26)
b <- c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B")
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1], labels = round(bp$stats[, 1], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
text(x = 2, y = bp$stats[, 2], labels = round(bp$stats[, 2], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
obs <- which(a %in% bp$out)
text(bp$group, bp$out, obs, pos=4)

Using matplot in R whenever certain column changes

Sorry in advance because I am new at asking questions here and don't know how to input this table properly.
Say I have a data frame in R constructed like:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind(team, value)
I want to create a plot that will give me 3 lines graphing the values for teams A, B, and C. I believe I can do this inputting the matrix m into matplot somehow, but I'm not sure how.
EDIT: I've gotten a lot closer to solving my problem. However I've realized that for some reason, with the code I have, "Value" is a list of 745 which matches the number of rows in my dataframe m. However when I unlist(Value) it turns into a numeric of length 894. Any ideas on why this would happen?
You can try something like this:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind.data.frame(team, value)
library(ggplot2)
ggplot(m, aes(x=as.factor(1:nrow(m)), y=value, group=team, col=team)) +
geom_line(lwd=2) + xlab('index')
if you have same number of ordered values for each team, you could use matplot to visualize them. but the data should be converted to matrix first;
m = cbind.data.frame(team, value, index = rep(1:3, 3))
m <- reshape(m, v.names = 'value', idvar = 'team', direction = 'wide', timevar = 'index')
matplot(t(m[, 2:4]), type = 'l', lty = 1)
legend('top', legend = m[, 1], lty = 1, col = 1:3)

Resources