I have a tibble, with many observations and variables.
What I want to do is simply calculate(grouping by variable1 and variable2) the mean of variableXXX,and the total number of missing values for each group.
this is what I have written so far:
data%>%
group_by(variable1,variable2)%>%
summarise(mean(variableXXX))
how can I calculate the number of missing values for each group? I am new using R, so the easiest solution is better
We can get the sum of logical vector created with is.na
library(dplyr)
data%>%
group_by(variable1,variable2)%>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(variableXXX)))
NOTE: Assuming that we are interested in the count of NAs in the 'variableXXX' column grouped by 'variable1' and 'variable2'
If we need the NA count of the whole subset of dataset
library(purrr)
data %>%
group_split(variable1, variable2) %>%
map_dfr(~ .x %>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(.))))
Related
I have dataset that contains logical variable ('verdad') and a group variable ('group') that splits all data into several groups. Now I would like to summarize the data and calculate mean of the logical variable to test the hypothesis that occurence of TRUE and FALSE values in 'verdad' column differs accross the groups. The code is as simple as this:
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
mean(verdad, na.rm = TRUE)
The datatype of 'verdad' is logical but it is showing this error:
In mean.default(., verdad, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Is there a way to fix it?
You simply need to wrap your mean in a summarize function.
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
summarize(verdad_mean = mean(verdad, na.rm = TRUE))
im new on R and i have a data set of 22x252, the 252 have many repeated values on column 1(ID). I made another dataset that has nrows of the unique values (with those values already populated), and i want to populate the rest of the columns based on the other dataset (basically summing all the values that share the same value in column 1.)
Is there a basic function that enables me to do this?
Thanks & Regards
We can use aggregate in base R. Assuming the column name of first column is 'ID' and all other columns are numeric class, we group by 'ID' and get the sum of the rest of the columns in aggregate
aggregate(.~ ID, df1, sum, na.rm = TRUE)
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), sum, na.rm = TRUE)
Or with new version with across
df1 %>%
group_by(ID) %>%
summarise(across(-group_cols(), sum, na.rm = TRUE))
Title is self-explanatory. Looking to calculate percent NA by ID group in R. There are lots of posts on calculating NA by variable column but almost nothing on doing it by row groups.
If there are multiple columns, after grouping by 'ID', use summarise_at to loop over the columns, create a logical vector with is.na, get the mean, and multiply by 100
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ 100 *mean(is.na(.)))
If we want to get the percentage across all other variables,
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(Perc = 100 * mean(is.na(value)))
Or with aggregate from base R
aggregate(.~ ID, df1, FUN = function(x) 100 * mean(is.na(x)), na.action = na.pass)
Or to get the percentage across, then unlist, the other columns, create a table with the logical vector and the 'ID' column, and use prop.table to get the percentage
prop.table(table(cbind(ID = df1$ID,
value = is.na(unlist(df1[setdiff(names(df1), "ID")]))))
So, I am using the data.frame 'studentdata' and I have sorted the males from the females. I also created a new column called HoursSlept. Now I must find the standard deviation for males in the column HoursSlept and the same for Females.
Can someone help me??
Here is what I did but I don't know what sd it is giving me.
If you want to do it your way you'd have to write:
sd(Males$HoursSlept, na.rm = T)
sd(Females$HoursSlept, na.rm = T)
Because Males and Females are data.frames and you have to pass one column from the data.frames to the function. A more elegant way would be not to split the data in two data.frames. Instead you could use dplyr's filter function.
library(dplyr)
studentdata %>%
filter(Gender == "male") %>%
summarise(sd = sd(HoursSlept, na.rm = T))
And the same for the females. Or as #MrGumble suggested both at once with group_by:
studentdata %>%
group_by(Gender) %>%
summarise(sd = sd(HoursSlept, na.rm = T))
library(Hmisc)
attach(studentdata)
summarize(HoursSlept,gender,sd)
I would like, when summarizing after grouping, to count the number of a specific level of another factor.
In the working example below, I would like to count the number of "male" levels in each group. I've tried many things with count, tally and so on but cannot find a straightforward and neat way to do it.
df <- data.frame(Group=replicate(20, sample(c("A","B"), 1)),
Value=rnorm(20),
Factor=replicate(20, sample(c("male","female"), 1)))
df %>%
group_by(Group) %>%
summarize(Value = mean(Value),
n_male = ???)
Thanks for your help!
We can use sum on a logical vector i.e. Factor == "male". The TRUE/FALSE will be coerced to 1/0 to get the frequency of 'male' elements when we do the sum
df %>%
group_by(Group) %>%
summarise(Value = mean(Value),
n_male = sum(Factor=="male"))