I have data with Income,spending,population and state. Income, spending and population has missing values.
I have created a for loop to replace the missing values by median which is calculated state-wise. However I have to run the for loop separately for Income, Spending and population. I tried to create a function to pass just the column names but it is giving me an error with is.na(). Here is the for loop
for (i in (unique(data$State))) {
data$Income[is.na(data$Income) & data$State==i] <-
median(data$Income[data$State==i], na.rm = TRUE)
}
In place of income I tried making a function and passing x.. but it is not working. Can someone help me achieve this function. I tried a few things but it gave me an error with is.na
Med_sub <- function(x){
for (i in (unique(data$State))) {
data$x[is.na(data$x)&data$State==i] <- median(data$x[data$State==i], na.rm = TRUE)
}
}
Med_sub(Income)
Med_sub(Population)
I am new to R. Any help would be greatly appreciated.
Consider a base R two-liner with ave (the inline aggregate function that slices numeric columns by factors) and ifelse all wrapped in a sapply loop:
median_fill <- function(x) ifelse(is.na(x), median(x, na.rm=TRUE), x)
data[c("Income","spending","population")] <- sapply(data[c("Income","spending","population")],
function(i) ave(i, data$state, FUN=median_fill))
A tidyverse three-liner:
library(dplyr)
data %>%
group_by(State) %>%
mutate_all(.funs = funs(coalesce(., median(., na.rm=TRUE))))
Related
In the code below, I'm trying to find the mean correct score for each item in the "category" column of the "regular season" dataset I'm working with.
rs_category <- list2env(split(regular_season, regular_season$category),
.GlobalEnv)
unique_categories <- unique(regular_season$category)
for (i in unique_categories)
Mean_[i] <- mean(regular_season$correct[regular_season$category == i], na.rm = TRUE, .groups = 'drop')
eapply(rs_category, Mean_[i])
print(i)
I'm having trouble getting this to work though. I have created a list of the items in the category as sub-datasets and separately, (I think) I have created a vector of the unique items in the category in order to run the for loop with. I have a feeling the problem may be with how I defined the mean function because an error occurs at the "eapply()" line and tells me "Mean_[i]" is not a function, but I can't think of how else to define the function. If someone could help, I would greatly appreciate it.
The issue would be that Mean_ wouldn't have an i name. In the below code, we initiaize the object 'Mean_' as type numeric with length as the same as length of 'unique_categories', then loop over the sequence of 'unique_categories', get the subset of 'correct', apply the mean function and store that as ith value of 'Mean_'
Mean_ <- numeric(length(unique_categories))
for(i in seq_along(unique_categories)) {
Mean_[i] <- mean(regular_season$correct[regular_season$category
== unique_categories[i]], na.rm = TRUE)
}
If we need to use a faster execution, use data.table
library(data.table)
setDT(regular_season[, .(Mean_ = mean(correct, na.rm = TRUE)), category]
Or using collapse
library(collapse)
fmean(slt(regular_season, category, correct), g = category)
Instead of splitting the dataset and using for loop R has functions for such grouping operations which I think can be used here. You can apply a function for each unique group (value).
library(dplyr)
regular_season %>%
group_by(category) %>%
summarise(Mean_ = mean(correct, na.rm = TRUE)) -> result
This gives you average value of correct for each category, where result$Mean_ is the vector that you are looking for.
In base R, this can be solved with aggregate.
result <- aggregate(correct~category, regular_season, mean, na.rm = TRUE)
I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)
Good Morning,
I got a lot of data and i have to calculate with it. There are 25 columns (variables) and each column contains thousands of values. But also missing values.
I calculated the mean with
colMeans(df, na.rm = TRUE)
How can i calculate the sd of each column and ignore the NA-values?
You can try,
apply(df, 2, sd, na.rm = TRUE)
As the output of apply is a matrix, and you will most likely have to transpose it, a more direct and safer option is to use lapply or sapply as noted by #docendodiscimus,
sapply(df, sd, na.rm = TRUE)
If we convert to matrix, colSds from matrixStats can be used
library(matrixStats)
colSds(as.matrix(df), na.rm=TRUE)
Or we can use summarise_each from dplyr
library(dplyr)
df1 %>%
summarise_each(funs(sd(., na.rm=TRUE)))
As the functioin summarise_each() has been deprecated, here is an up-to-date example using dplyr:
df1 %>% summarise_all(funs(sd(., na.rm = FALSE)))
sd(variablenname,na.rm=TRUE)
This works for me. Replace "variablename" with the variable you use.
I have a dataframe in the form of:
df:
RepName, Discount
Bob,Smith , 5383.24
Johh,Doe , 30349.21
...
The names are repeated. In the df, RepName is a factor and Discount is numeric. I want to calculate the mean per RepName. I can't seem to get the aggregate statement right.
I've tried:
#This doesn't work
repAggDiscount <- aggregate(repdf, by = repdf$RepName, FUN = mean)
#Not what I want:
repAggDiscount <- aggregate(repdf, by = list(repdf$RepName), FUN = mean)
I've also tried the following:
repnames <- lapply(repdf$RepName, toString)
repAggDiscount <- aggregate(repdf, by = repnames, FUN = mean)
But that gives me a length mismatch...
I've read the help but an example of how this should work for my data would go a long way... thanks!
I'm posting #AnandaMahto's answer here to close out the question. You can use the formula syntax
aggregate(Discount ~ RepName, repdf, mean)
Or you can use the by= syntax
repAggDiscount <- aggregate(repdf$Discount, by = list(repdf$RepName), FUN = mean)
The problem with your syntax was that you were trying to aggregate the whole data.frame which included the RepName column where taking the mean doesn't make sense
You could also to
repAggDiscount <- aggregate(repdf[,-1], by = repdf[,1,drop=F], FUN = mean)
which is closer to the matrix style syntax.
Suppose, you have a data.frame like this:
x <- data.frame(v1=1:20,v2=1:20,v3=1:20,v4=letters[1:20])
How would you select only those columns in x that are numeric?
EDIT: updated to avoid use of ill-advised sapply.
Since a data frame is a list we can use the list-apply functions:
nums <- unlist(lapply(x, is.numeric), use.names = FALSE)
Then standard subsetting
x[ , nums]
## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)
For a more idiomatic modern R I'd now recommend
x[ , purrr::map_lgl(x, is.numeric)]
Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:
dplyr::select_if(x, is.numeric)
Newer versions of dplyr, also support the following syntax:
x %>% dplyr::select(where(is.numeric))
The dplyr package's select_if() function is an elegant solution:
library("dplyr")
select_if(x, is.numeric)
Filter() from the base package is the perfect function for that use-case:
You simply have to code:
Filter(is.numeric, x)
It is also much faster than select_if():
library(microbenchmark)
microbenchmark(
dplyr::select_if(mtcars, is.numeric),
Filter(is.numeric, mtcars)
)
returns (on my computer) a median of 60 microseconds for Filter, and 21 000 microseconds for select_if (350x faster).
in case you are interested only in column names then use this :
names(dplyr::select_if(train,is.numeric))
iris %>% dplyr::select(where(is.numeric)) #as per most recent updates
Another option with purrr would be to negate discard function:
iris %>% purrr::discard(~!is.numeric(.))
If you want the names of the numeric columns, you can add names or colnames:
iris %>% purrr::discard(~!is.numeric(.)) %>% names
This an alternate code to other answers:
x[, sapply(x, class) == "numeric"]
with a data.table
x[, lapply(x, is.numeric) == TRUE, with = FALSE]
library(purrr)
x <- x %>% keep(is.numeric)
The library PCAmixdata has functon splitmix that splits quantitative(Numerical data) and qualitative (Categorical data) of a given dataframe "YourDataframe" as shown below:
install.packages("PCAmixdata")
library(PCAmixdata)
split <- splitmix(YourDataframe)
X1 <- split$X.quanti(Gives numerical columns in the dataset)
X2 <- split$X.quali (Gives categorical columns in the dataset)
If you have many factor variables, you can use select_if funtion.
install the dplyr packages. There are many function that separates data by satisfying a condition. you can set the conditions.
Use like this.
categorical<-select_if(df,is.factor)
str(categorical)
Another way could be as follows:-
#extracting numeric columns from iris datset
(iris[sapply(iris, is.numeric)])
Numerical_variables <- which(sapply(df, is.numeric))
# then extract column names
Names <- names(Numerical_variables)
This doesn't directly answer the question but can be very useful, especially if you want something like all the numeric columns except for your id column and dependent variable.
numeric_cols <- sapply(dataframe, is.numeric) %>% which %>%
names %>% setdiff(., c("id_variable", "dep_var"))
dataframe %<>% dplyr::mutate_at(numeric_cols, function(x) your_function(x))