How to calculate mean of data frame in R? - r

I have a data.frame "nitrates". And I have to calculate the mean of the values.
When I use:
mean(nitrates)
it gives me NA with the warning:
Warning message:
In mean.default(nitrates) : argument is not numeric or logical: returning NA
I want to calculate the mean of data. How can I do that?

Let say you have a dataframe containing mixed string and numeric columns. Since mean is defined for numeric values, you need to first select numeric columns and then move forward with averaging. I don't have your dataframe, so I provide an example with another dataframe, but you can replace storms with nitrates.
library('dplyr')
data('storms')
# mean for each column
storms %>% select_if(is.numeric) %>% apply(2, mean, na.rm=T)
# mean for each row
storms %>% select_if(is.numeric) %>% apply(1, mean, na.rm=T)
# mean over all elements
storms %>% select_if(is.numeric) %>% as.matrix() %>% mean(na.rm=T)

Related

R Summarize and calculate mean for logical variable

I have dataset that contains logical variable ('verdad') and a group variable ('group') that splits all data into several groups. Now I would like to summarize the data and calculate mean of the logical variable to test the hypothesis that occurence of TRUE and FALSE values in 'verdad' column differs accross the groups. The code is as simple as this:
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
mean(verdad, na.rm = TRUE)
The datatype of 'verdad' is logical but it is showing this error:
In mean.default(., verdad, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Is there a way to fix it?
You simply need to wrap your mean in a summarize function.
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
summarize(verdad_mean = mean(verdad, na.rm = TRUE))

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

Impute Missing Values for Numerical and Categorical, and center and scale the categorical values with exceptions in R

I would like to impute median for numerical missing values and mode for categorical missing values
and then convert all the categorical values into dummies, center and scale them.
However, I do not want to convert the customer IDs, nor to center and scale them.
Could you help me to fix my code?
library(recipes)
train.recipe <- recipe(y ~., data = trainingdata) %>%
step_medianimpute(all_numeric()) %>%
step_modeimpute(all_nominal())
step_dummy(all_nominal(), -all_outcomes(), - trainingdata$Customer_ID) %>%
step_center(all_predictors(), -trainingdata$Customer_ID) %>%
step_scale(all_predictors(), -trainingdata$Customer_ID)
train.recipe %>%
prep() %>%
bake(., data.clean) %>%
glimpse()
Without knowing your data frame and assuming that customer id is the only variable you do not want to transform you can simply first transform the ids to rownames before transforming:
df %>%
column_to_rownames("id") # to convert column to rownames
df %>%
rownames_to_column("id") # to revert that
To do this the customer ids need to be unique!

How do I write these using pipes in R?

How do I get subgroups by using pipes? I don't understand why what I wrote doesn't work. Can someone explain how these work, reading online and seeing examples online hasn't help me because I am not sure what I am not understanding?
mean(mtcars$qsec)
mtcars %>%
select(qsec) %>%
mean()
Warning message:
In mean.default(.) : argument is not numeric or logical: returning NA
mean(mtcars$qsec[mtcars$cyl==8])
mtcars %>%
group-by(qsec) %>%
filter(cyl==8)
mean()
Error in mean.default() : argument "x" is missing, with no default
mean(mtcars$mpg[mtcars$hp > median(mtcars$hp)])
mtcars %>%
group_by(mpg) %>%
filter(hp>median(hp))
mean
The reason is that select still returns a data.frame with one column and mean expects a vector based on the ?mean
x - An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.
We can use pull to extract the column as a vector and apply the mean on it
library(dplyr)
mtcars %>%
pull(qsec) %>%
mean
#[1] 17.84875
In the second case, we are getting the mean of 'qsec' where 'cyl' is 8
mtcars %>%
select(qsec, cyl) %>%
filter(cyl == 8) %>%
pull(qsec) %>%
mean
#[1] 16.77214

Normalize specified columns in dplyr by value in first row

I have a data frame with four rows, 23 numeric columns and one text column. I'm trying to normalize all the numeric columns by subtracting the value in the first row.
I've tried getting it to work with mutate_at, but I couldn't figure out a good way to get it to work.
I got it to work by converting to a matrix and converting back to a tibble:
## First, did some preprocessing to get out the group I want
totalNKFoldChange <- filter(signalingFrame,
Population == "Total NK") %>% ungroup
totalNKFoldChange_mat <- select(totalNKFoldChange, signalingCols) %>%
as.matrix()
normedNKFoldChange <- sweep(totalNKFoldChange_mat,
2, totalNKFoldChange_mat[1,])
normedNKFoldChange %<>% cbind(Timepoint =
levels(totalNKFoldChange$Timepoint)) %>%
as.tibble %>%
mutate(Timepoint = factor(Timepoint,
levels = levels(totalNKFoldChange$Timepoint)))
I'm so certain there's a nicer way to do it that would be fully dplyr native. Anyone have tips? Thank you!!
If we want to normalize all the numeric columns by subtracting the value in the first row, use mutate_if
library(dplyr)
df1 %>%
mutate_if(is.numeric, list(~ .- first(.)))

Resources