R -ggplot two boxplots for separate columns - r

So I am trying to make two boxplots on one graph of two separate variables.
I have a dataset with multiple variables but I wanna compare only two: income_husband, and income_wife.
I have done it using boxplot() but how can i do it using ggplot ?

It would help if you had some data to work with but I have put some sample data together. Group C is filtered out. Is this what you are sort of after?
library(tidyverse)
group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", 'c', "c")
income = c(100, 120, 110, 23, 34, 120, 45, 156, 65, 52, 65, 98)
data <- tibble(group, income)
data
data2 <- data %>%
filter(group == "a" | group == "b" )
b <- ggplot(data2, aes(x = group, y = income))
b + geom_boxplot()

Related

Transferring name of column to a function in R

I'm trying to write a function which returns specific details about outliers (only sex, age, education, and the outlying value). I need to do it with many parameters, so I would like to transfer name of column to the function. Is there a way to do it?
For example, this code should return: f, 27, 12, 110.
my_data= data.frame( sex= c("f", "m", "f", "f", "m"),
age= c(22, 30, 24, 27, 30),
eduyears= c(12,16, 15, 12, 17),
weight= c(53, 70, 60, 110, 75),
height= c(160, 183, 157, 168, 180))
find_outliers= function (my_data, colname) {
out_values= boxplot.stats(my_data$colname)$out
out_ind= which(my_data$colname %in% out_values) #find outliers indices
outliers= my_data[out_ind ,c("sex","age","eduyears", colname)]
return (outliers)
}
find_outliers(weight)
If the function has two arguments you need to pass them both in its call, you are only passing one, weight. And passing as an unquoted variable means the function must get the column name as a character string in order to access it.
Finally, see the famous question on how to Dynamically select data frame columns using $ and a vector of column names.
my_data <- data.frame(sex = c("f", "m", "f", "f", "m"),
age = c(22, 30, 24, 27, 30),
eduyears = c(12,16, 15, 12, 17),
weight = c(53, 70, 60, 110, 75),
height = c(160, 183, 157, 168, 180))
find_outliers <- function (my_data, colname) {
# get the colname as a character string
colname <- as.character(substitute(colname))
out_values <- boxplot.stats(my_data[[colname]])$out
out_ind <- which(my_data[[colname]] %in% out_values) #find outliers indices
outliers <- my_data[out_ind, c("sex","age","eduyears", colname)]
outliers
}
find_outliers(my_data, weight)
#> sex age eduyears weight
#> 4 f 27 12 110
my_data |> find_outliers(weight)
#> sex age eduyears weight
#> 4 f 27 12 110
Created on 2022-11-05 with reprex v2.0.2

Recoding several columns at once

I'm recoding values to letters with the following line of code (which worked) :
df_mean$COMMUNITY_mean <- cut(df_mean$COMMUNITY_mean, breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
In order to apply it to multiple columns :
names <- colnames(df_mean) #extract columns names to a list
names <- names[-c(1:10)]; #remove 10 first columns not interested in
for(i in 1:length(names)) {
df_mean <- cut(names[[i]], breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
}
But it fails to execute "Error in cut.default(names[[i]], breaks = c(0, 10, 25, 50, 75, 90, Inf), :
'x' must be numerical"
Any suggestions ?
Try this in the first line
names <- colnames(df_mean[as.logical(lapply(df_mean , is.numeric))])
# remove this line ===> names <- colnames(df_mean)
to extract the numerical columns from your data

proportion within each factor using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 1 year ago.
I want to get the prop inside each factor using dplyr. The desired result appears in desired$prop
Thanks in advance :))
data <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa","uk",
"spain","usa","uk","spain","usa","uk","spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25)
)
desired <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa",
"uk","spain","usa","uk","spain","usa","uk",
"spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25),
prop = c(0.285714286,0.181818182,0.142857143,0.357142857,
0.272727273,0.5,0.357142857,0.545454545,
0.357142857)
)
#MrFlick is right. And also faster than I am.
library(dplyr)
df <- data %>%
group_by(country) %>%
mutate(prop = value/sum(value))

Problem with 'mutate()' input 'data' in ANOVA (rstatix)

This is driving me crazy. I am using anova_test from rstatix and it's telling me that my columns aren't there when they clearly are.
This is what my dataframe looks like:
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
df <- data.frame(ID, Form, Pen, Time)
ID, Form, and Pen are factors, while Time is numeric. So each subject completed forms A and B with Red, Blue, and Green pens, and I measured how long each took in completing the form.
This is a fake dataset that I've purposefully come up with to ask this question. In reality, this dataframe is derived from a larger dataset with several more variables. Each variable has a lot more observations (so not just one datapoint for subject 1 & Form A & Red Pen, as in this example, but multiple), so I've summarized them to get mean Time.
df <- original.df %>% dplyr::select(ID, Form, Pen, Time)
df <- df %>% dplyr::group_by(ID, Form, Pen) %>% dplyr::summarise(Time = mean(Time))
df <- df %>% convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
I wanted to test the main and interaction effects, so I'm doing a 2 by 3 repeated measures ANOVA (a two-way ANOVA, because Form and Pen are two independent variables).
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
and I KEEP getting this error:
Error: Problem with `mutate()` input `data`.
x Can't subset columns that don't exist.
x Columns `ID` and `Form` don't exist.
ℹ Input `data` is `map(.data$data, .f, ...)`.
WHY?! Any help would be greatly appreciated. I've been searching solutions for HOURS and I'm getting pretty frustrated.
Thank you for adding the additional details to the post - based on what you've provided it looks like you need to ungroup your df before passing it to anova_test(), e.g.
#install.packages("rstatix")
library(rstatix)
library(tidyverse)
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
original.df <- data.frame(ID, Form, Pen, Time)
df <- original.df %>%
dplyr::select(ID, Form, Pen, Time)
df <- df %>%
dplyr::group_by(ID, Form, Pen) %>%
dplyr::summarise(Time = mean(Time))
df <- df %>%
convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
df <- ungroup(df)
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
You can see whether a dataframe is grouped using str(), e.g. str(df) before and after ungrouped() shows you the difference. Please let me know if you are still getting errors after making this change

Aggregate the data in R

I have a data set that is shown below:
library(tidyverse)
data <- tribble(
~category, ~product_id,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
)
And now, I want to group it by the "category" variable, keep the "product_id" and add a new variable that counts the categories:
aggregated_data <- tribble(
~category, ~product_id, ~numberOfcategory
"A", 10, 5,
"B", 20, 3,
"C", 30, 2,
)
I already got the "numberOfcategory" with this code:
data %>%
group_by(category) %>%
tally(sort=TRUE)
But somehow I could not keep the product_id.
Could someone help me to get the dataframe (aggregated_data)? Thanks in advance.
You were close! Just also group by product_id as follows:
data %>%
group_by(category,product_id) %>%
tally(sort=TRUE)

Resources