Impute missing data with mean by group - r

I have a categorical variable with three levels (A, B, and C).
I also have a continuous variable with some missing values on it.
I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A.
I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.
A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)
Now, I understand I could do the same for group B and C, but perhaps a for loop (with if and else) might do the trick?

require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))
For a faster, base-R version, you can use ave:
data$variable<-ave(data$variable,data$group,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))

You could use data.table package to achieve this-
tomean <- c("var1", "var2")
library(data.table)
setDT(dat)
dat[, (tomean) := lapply(tomean, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
})]

Related

Subset data frame by factor cardinality?

I suspect that this will be a duplicate, but my efforts to find an answer have failed. Suppose that I have a data frame with columns made entirely of either integers or factors. Some of these columns have factors with many levels and some do not. Suppose that I want to select parts of or otherwise subset the data such that I only get the columns with factors that have less than 10 levels. How can I do this? My first thought was to make a particularly nasty sapply command, but I'm hoping for a better way.
We can use select_if
library(dplyr)
df1 %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
With a reproducible example using iris
data(iris)
iris %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
Or using sapply
i1 <- sapply(df1, function(x) is.factor(x) && nlevels(x) < 10)
df1[i1]
With data.table you can do:
library(data.table)
setDT(df)
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))<10)]
Example:
df <- data.table(x = factor(1:3, levels = 1:5), y = factor(1:3, levels = 1:10))
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))>5)]
y
1: 1
2: 2
3: 3

R query - Is it possible to use "sapply" and the "weighted.mean" function together?

I've been using a code to run means for specific variable values (demographic breaks), however I now have data that has a weight variable and need to calculate weighted means. I've already been using a code to calculate sample means, and was wondering if it's possible to change change or adjust the function to calculate the weighted mean. Here is some code to generate sample data
df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))
So far I've been using this code to get sample means
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))
> Gender_Profile_1
sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender 1.000
agegroup 4.200
attitude_1 4.000
attitude_2 4.000
attitude_3 2.300
income 77274.700
weight 3.016
As you can see it generates Gender_Profile_1 with the means for all variables.
In my attempt to calculate the weighted mean, I've tried to change the "FUN=" part to this
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
I get the following error message
Error in weighted.mean.default(x, w = weight, na.rm = TRUE) :
'x' and 'w' must have the same length
I've been trying all kinds of permutations of df$weight and df$x, but nothing seems to work.
Any help or ideas would be great. Many thanks
Base R
If you want to stick to base R, you can do the following:
# define func to return all weighted means
all_wmeans <- function(data_subset) {
# which cols to summarise? all but gender and weight
summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))
# for each col, calc weighted mean with weights from the 'weight' column
result <- lapply(data_subset[, summ_cols],
weighted.mean, w=data_subset$weight)
# squeeze the resuling list back to a data.frame and return
return(data.frame(result))
}
# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)
The result is a list of two data frames, for each value of gender:
$`1`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.397546 4.027851 3.950597 1.962202 74985.25
$`2`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.092234 3.642666 3.676287 2.388872 64075.23
The fabulous data.table
If you don't mind using packages, dplyr and data.table are great packages that make this kind of stuff much simpler. Here's data.table:
# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)
# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]
which returns:
gender agegroup attitude_1 attitude_2 attitude_3 income weight
1: 2 4.092234 3.642666 3.676287 2.388872 64075.23 4.099426
2: 1 4.397546 4.027851 3.950597 1.962202 74985.25 3.904483
The data.table code also groups the rows by gender, and uses lapply to apply a function and extra argument to each Subset of Data (that's what the .SD call is). Conceptually, it's the exact same as the base R code, just compact and fast.
You can do the whole lot at once like this:
sapply(1:2, function(y)
sapply(subset(df, df$gender == y), function(x)
weighted.mean(x, df$weight[df$gender == y])))
#> [,1] [,2]
#> gender 1.000000 2.000000
#> agegroup 4.397546 4.092234
#> attitude_1 4.027851 3.642666
#> attitude_2 3.950597 3.676287
#> attitude_3 1.962202 2.388872
#> income 74985.247679 64075.232966
#> weight 3.904483 4.099426
I think the main problem with your code is that you are calling the weights column inside the sapply loop, however, this column has not been subsetted (as df has). Thus, you could just subset the weights columns before the sapply and then loop using that subsetted weights.
Using the code you posted:
weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
Here is another solution using apply, that might be easier to implement:
#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))

Removing duplicate rows from a data frame in R, keeping those with a smaller/larger value

I am trying to remove duplicate rows in an R data frame, but I want the condition that the row with a smaller or larger value (not bothered for the purpose of this question) in a certain column should be kept.
I can remove duplicate rows normally (from either side) like this:
df = data.frame( x = c(1,1,2,3,4,5,5,6,1,2,3,3,4,5,6),
y = c(rnorm(4),NA,rnorm(10)),
id = c(rep(1,8), rep(2,7)))
splitID <- split(df , df$id)
lapply(splitID, function(x) x[!duplicated(x$x),] )
How can I condition the removal of duplicate rows?
Thanks!
Use ave() to return a logical index to subset your data.frame
idx = as.logical(ave(df$y, df$x, df$id, FUN=fun))
df[idx,, drop=FALSE]
Some possible fun include
fun1 = function(x)
!is.na(x) & !duplicated(x) & (x == min(x, na.rm=TRUE))
fun2 = function(x) {
res = logical(length(x))
res[which.min(x)] = TRUE
res
}
The dplyr version of this might be
df %>% group_by(x, id) %>% filter(fun2(y))
We may need to order before applying the duplicated
lapply(splitID, function(x) x[!duplicated(x[order(x$x, x$y),]$x),] )
and for the reverse, i.e. keeping the larger values, order with decreasing = TRUE

Apply a rolling sum by group in R

I need to calculate a rolling sum by group.
y<- 1:10
tmp<-data.frame(y)
tmp$roll<-NA
tmp$roll[2:10]<-rollapply (y, 2, sum)
tmp$g<-(c("a","a","a","a","a","b","b","b","b","b"))
tmp$roll calculates the rolling sum for tmp$y; I need to do this by tmp$g. I think I may need to split the data frame into a list of data frames by group and then bind back together but this seems like a long route. The result would be an additional column of the rolling sum by group a and b (this a simplified example of actual data frame):
roll_group
NA
3
5
7
9
NA
13
15
17
19
Here is the data.table way:
library(data.table)
tmp.dt <- data.table(tmp)
tmp.dt <- tmp.dt[, .(y =y, roll = cumsum(y)), by = g]
You can do it with dplyr package as well.
Thanks but the answers provided in this post use the cumsum whereas I need the rolled sum with NA's if there aren't enough lagged values. I solved it this way:
#function to calculate rolled sum, returns a column vector
roll<-function(x,lags){
if (length(x)<lags) {
tmp=c(rep(NA,length(x)))
}
else {
tmp=rollsum(x, lags, align = "right", fill = NA)
}
tmp=as.numeric(tmp)
return(tmp)
}
tmp1 <- tmp %>%
group_by(g) %>%
mutate(roll_group = ave(y, g, FUN = function(x) roll(x, 2)))%>%
ungroup
How about wrapping it in tapply (or lapply split):
tapply(y, tmp$g, cumsum)
Consider this base solution with sapply() combining running count and running sum:
tmp$roll <- sapply(1:nrow(tmp),
function(i)
sum((tmp[1:i, c("g")] == tmp$g[i]) * tmp[1:i,]$y)
)

Mutate to Create Minimum in Each Row

I have a question relating to creating a minimum value in a new column in dplyr using the mutate function based off two other columns.
The following code repeats the same value for each row in the new column. Is there a way to create an independent minimum for each row in the new column? I wish to avoid using loops or the apply family due to speed and would like to stick with dplyr if possible. Here's the code:
a = data.frame(runif(5,0,5))
b = data.frame(runif(5,0,5))
c = data.frame(runif(5,0,5))
y = cbind(a,b,c)
colnames(y) = c("a","b","c")
y = mutate(y, d = min(y$b, y$c))
y
The new column "d" is simply a repeat of the same number. Any suggestions on how to fix it so that it's the minimum of "b" and "c" in each row?
Thank you for your help.
We can use pmin
y$d <- with(y, pmin(b, c))
Or
transform(y, d = pmin(b,c))
Or with dplyr
library(dplyr)
y %>%
mutate(d = pmin(b,c))
min works columnwise, suppose if we want to use min, an option would be
y %>%
rowwise %>%
mutate(d = min(unlist(c(b,c))))
You could make the min function apply by rows rather than columns by using the apply function and setting the margin argument to MARGIN = 1. Your rowwise min function would look like this:
apply(y, MARGIN = 1, FUN = function(x) min(x)))
Then, in order to make the rowwise min function only apply to columns b and c, you could use the select function within mutate, like this:
y %>% mutate(b.c.min =
y %>%
select(one_of("b", "c")) %>%
apply(MARGIN = 1, FUN = function(x) min(x)))

Resources