R Summarize and calculate mean for logical variable - r

I have dataset that contains logical variable ('verdad') and a group variable ('group') that splits all data into several groups. Now I would like to summarize the data and calculate mean of the logical variable to test the hypothesis that occurence of TRUE and FALSE values in 'verdad' column differs accross the groups. The code is as simple as this:
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
mean(verdad, na.rm = TRUE)
The datatype of 'verdad' is logical but it is showing this error:
In mean.default(., verdad, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Is there a way to fix it?

You simply need to wrap your mean in a summarize function.
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
summarize(verdad_mean = mean(verdad, na.rm = TRUE))

Related

Is there a limit of factors in `dplyr::group_by`?

I'm struggling on how can I calculate the wear of a component using the lag of a variable. However, I need to calculate the wear on different groups, so I'm using the group_by function, but here's a problem, when I use the variable that I need to group, this results in a column of "NA's", but when I test by grouping one another variable that has fewer factors the calculation works.
The dataframe I'm using has 4093902 rows and 52 lines. The variable I need to group to perform my wear calculation has 90183 factors. The other one that I tested and it worked had 11321 factors.
Here's the code I'm using:
final_date = result_data %>%
arrange((time)) %>%
group_by(id_specific)%>%
mutate(wear = dplyr::lag(some_value, n = 1, default = NA) - some_value)
Does anyone know if there is a factor limit for grouping? Or any other tips on how I can perform this calculation?
The NA can be a result of either lag which returns the first value by default as NA or from the other column value which can also be NA. Thus, when we do the - (or any arithmetic) if there is any NA in the lhs or rhs, it returns NA. One option is to make use of a function (rowSums) that can use na.rm = TRUE
library(dplyr)
final_date <- result_data %>%
arrange((time)) %>%
group_by(id_specific)%>%
mutate(some_value_new = dplyr::lag(some_value, n = 1,
default = NA)) %>%
ungroup %>%
mutate(wear = rowSums(cbind(some_value_new, -1 * some_value),
na.rm = TRUE), some_value_new = NULL)
NOTE: It is also better to ungroup before doing the rowSums to get some efficiency

Is there a way to calculate proportions by groups?

I'm trying to calculate the following proportion for each city: mean(age < 25).
My code so far is the following:
namevar <- data %>% group_by(city) %>% mean (age < 25).
My data is clean and has no NA.
If I use mean(age <25) it works, but when I use the group_by function it doesn't.
This is the message that appears:
In mean.default(unlist(x, use.names = FALSE, recursive = TRUE), :
argument is not numeric or logical: returning NA
Thanks a lot for reading and helping :)
We can use mutate (if we want to create a new column) or summarise (if needed to summarise)
library(dplyr)
data1 <- data %>%
group_by(city) %>%
summarise(Prop = mean(age < 25))

Calculating each factor level's sd() for a variable

I have been asked by my coauthor to add sd to the factor variables that have more than two levels, and sd(as.numeric(df$factor)) is giving me a single output instead of the sd for each. I imagine purrr::map could handle it but df%>% select(factor) %>% as.numeric %>% map(~(sd(.))) outputs an error Error in function_list[[i]](value) : 'list' object cannot be coerced to type 'double' even though df is not a list.
If it is the sd for each level of the factor column, we need to use that as a grouping variable
library(dplyr)
df %>%
group_by(factor) %>%
summarise(SD = sd(anothercolumn, na.rm = TRUE))
Based on the description, if we need the sd of factor variables having more than two levels
df %>%
summarise(across(where(~ is.factor(.) && nlevels(.) >2),
~ sd(as.numeric(.))))

How to calculate mean of data frame in R?

I have a data.frame "nitrates". And I have to calculate the mean of the values.
When I use:
mean(nitrates)
it gives me NA with the warning:
Warning message:
In mean.default(nitrates) : argument is not numeric or logical: returning NA
I want to calculate the mean of data. How can I do that?
Let say you have a dataframe containing mixed string and numeric columns. Since mean is defined for numeric values, you need to first select numeric columns and then move forward with averaging. I don't have your dataframe, so I provide an example with another dataframe, but you can replace storms with nitrates.
library('dplyr')
data('storms')
# mean for each column
storms %>% select_if(is.numeric) %>% apply(2, mean, na.rm=T)
# mean for each row
storms %>% select_if(is.numeric) %>% apply(1, mean, na.rm=T)
# mean over all elements
storms %>% select_if(is.numeric) %>% as.matrix() %>% mean(na.rm=T)

Count the number of missing values in groups in R

I have a tibble, with many observations and variables.
What I want to do is simply calculate(grouping by variable1 and variable2) the mean of variableXXX,and the total number of missing values for each group.
this is what I have written so far:
data%>%
group_by(variable1,variable2)%>%
summarise(mean(variableXXX))
how can I calculate the number of missing values for each group? I am new using R, so the easiest solution is better
We can get the sum of logical vector created with is.na
library(dplyr)
data%>%
group_by(variable1,variable2)%>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(variableXXX)))
NOTE: Assuming that we are interested in the count of NAs in the 'variableXXX' column grouped by 'variable1' and 'variable2'
If we need the NA count of the whole subset of dataset
library(purrr)
data %>%
group_split(variable1, variable2) %>%
map_dfr(~ .x %>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(.))))

Resources