Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to find the mean of wages for males and females. How do i find the compute mean for wages that belong in the female column and for wages that belong in the male column.
Please always ask your question in a reproducible fashion.
See here: https://stackoverflow.com/help/minimal-reproducible-example
Regarding your question, I'm using the mtcars dataset as an example, and assuming you have all female values in one column in another (like we have mpg and cyl here), then you could use the tidyverse:
library(tidyverse)
data(mtcars)
mtcars %>%
summarise(across(.cols = c(mpg,cyl),.fns = mean))
or for your question, assuming that your dataset is called df and your columns are called female and male:
df %>%
summarise(across(.cols = c(female,male),.fns = mean))
If, however, your data was organised differently and you had gender in one separate column and e.g. the value you want to take the mean for in a column called value, then you should do:
mtcars %>%
group_by(vs) %>%
summarise(resulting_mean_mpg = mean(mpg))
Where we have calculated the mean mpg by vs.
In your case this might be
df %>%
group_by(gender) %>%
summarise(resulting_mean_value = mean(value))
Hope this helps!
We can use colMeans in base R after selecting the columns of interest
colMeans(df1[c('male', 'female')], na.rm = TRUE)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Good Morning Everyone,
I have a small issue regarding a dataframe :
I have 165 differents countries, sometimes with more than 30 occurencies. What I would like to do is take only 30 occurencies for each country, and then apply the mean function on the related variables.
Do you have any idea how I can achieve this?
Here is the dataframe :
Thanks for your answer,
RĂ©mi
Assuming you want to take out 30 rows for each group, we can do the following. Unfortunately, dplyr's sample_n cannot handle when the input data frame has less rows than you want to sample (unless you want to sample with replacement).
Where df is your data.frame:
Solution 1:
library(dplyr)
df %>% group_by(Nationality) %>%
sample_n(30, replace=TRUE) %>%
distinct() %>% # to remove repeated rows where nationalities have less than 30 rows
summarise_at(vars(Age, Overall, Passing), funs(mean))
Solution 2:
df %>% split(.$Nationality) %>%
lapply(function(x) {
if (nrow(x) < 30)
return(x)
x %>% sample_n(30, replace=FALSE)
}) %>%
do.call(what=bind_rows) %>%
group_by(Nationality) %>%
summarise_at(vars(Age, Overall, Passing), funs(mean))
Naturally without guarantee as you did not supply a working example.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am using boxplot to show the distribution among 5 different data sets.
I know it is possible to arrange them based on their median values.
What I am looking for is to arrange them based on the difference between the first quartile and the third.
Obviously I do not want to arrange them manually by reordering the levels.
I have fixed this using tidyverse group_by and summarise and calculating the difference between the desired quartiles and using that to arrange the boxes.
If anyone need the code or has a better solution, please let me know.
Thank you.
Do you mean the Interquartile range (IQR())? If so you can do
diamonds %>%
as.tibble() %>%
ggplot(aes(reorder(cut, price, IQR), price)) +
geom_boxplot()
Here is how I ordered my boxplots based on the difference between 1st and 3rd quartiles. "df" is your data.frame, "column1" is the column you want to group by based on, and "column2" contains your values which you are trying to see the distribution on.
DisTable <- df %>%
group_by(column1) %>%
summarise(Min=quantile(column2,probs=0.0),
Q1=quantile(column2, probs=0.25),
Median=quantile(column2, probs=0.5),
Q3=quantile(column2, probs=0.75),
Max=quantile(column2,probs=1),
DiffQ3Q1=Q3-Q1) %>%
arrange(desc(DiffQ3Q1))
bporder <- as.character(DisTable$column1)
ggplot(df,aes(x=factor(df$column1,levels=bporder),y=column2,fill=column1))+
geom_boxplot()
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a dataframe that has indicators as values of a column
X Y Ind
1 10000 N
2 10000 N
3 10000 G
4 10000 L
I want to create a bargraph using ggplot that will give me the Total Count and Total Y based on Indicator Value side by side .
I am trying to figure out how to implement an aggregation on the dataframe without summarizing it and creating a Count value per categorical value of Ind
Updated: This
One option would be to get count (n()) and sum of 'Y' after grouping by 'Ind', gather (from tidyr to reshape it to 'long' format) and get the barplot with geom_bar (from ggplot2).
library(dplyr)
library(tidyr)
library(ggplot2)
df1 %>%
group_by(Ind) %>%
summarise(Count=n(), TotalY = sum(Y)) %>%
gather(Var, Val, -Ind) %>%
ggplot(., aes(x=Ind, y = Val, fill=Var)) +
geom_bar(stat="identity", position="dodge")
This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 7 years ago.
I am trying to calculate an overall mean of multiple classes. Currently the database is in long format. I tried selecting first ID number (group variable 1), then a dummy variable (stem=1) classes that I am interested in (grouping variable 2), and then calculating one GPA mean (i.e., stem GPA mean) for the grades received in interested classes (stem=1).
I have an attached an example of the database below. Overall,, I am trying figure out how to calculate stem GPA for each student.
See example here
I have tried using library(psych), describeBy(data, dataset$id, dataset$stem), but to no avail. Any suggestions?
I prefer the dplyr package for these operations. Try e.g.
df %>% group_by(class) %>% summarise(mean_class=mean(class))
For instance, using the mtcars dataset:
library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(mean_disp = mean(disp))
will give you all the means of disp based on the grouping variable cyl.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
From the following data, I want to extract mean for Males and Females separately. How do I achieve this in R??
Or you can use dplyr:
df <- data.frame(variable=c(rep('Males', 10), rep('Females', 10)), value=sample(1:1000, 20))
df$variable <- as.factor(df$variable)
df2 <- df %>% group_by(variable) %>% summarise(average = mean(value))
df2
Source: local data frame [2 x 2]
variable average
1 Females 566.8
2 Males 575.0
You could use which to identify the rows for male/female. Here some example data:
df <- data.frame(variable=c(rep('Males', 10), rep('Females', 10)), value=sample(1:1000, 20))
and then
mean(df[which(df$variable=='Males'),]$value)
mean(df[which(df$variable=='Females'),]$value)
Also have a look at aggregate:
aggregate(.~variable, data=df, mean)